352 research outputs found

    Replacing 6T SRAMs with 3T1D DRAMs in the L1 data cache to combat process variability

    Get PDF
    With continued technology scaling, process variations will be especially detrimental to six-transistor static memory structures (6T SRAMs). A memory architecture using three-transistor, one-diode DRAM (3T1D) cells in the L1 data cache tolerates wide process variations with little performance degradation, making it a promising choice for on-chip cache structures for next-generation microprocessors.Peer ReviewedPostprint (published version

    Weightless: Lossy Weight Encoding For Deep Neural Network Compression

    Get PDF
    The large memory requirements of deep neural networks limit their deployment and adoption on many devices. Model compression methods effectively reduce the memory requirements of these models, usually through applying transformations such as weight pruning or quantization. In this paper, we present a novel scheme for lossy weight encoding which complements conventional compression techniques. The encoding is based on the Bloomier filter, a probabilistic data structure that can save space at the cost of introducing random errors. Leveraging the ability of neural networks to tolerate these imperfections and by re-training around the errors, the proposed technique, Weightless, can compress DNN weights by up to 496x with the same model accuracy. This results in up to a 1.51x improvement over the state-of-the-art

    S3^{3}: Increasing GPU Utilization during Generative Inference for Higher Throughput

    Full text link
    Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem. To this end, we propose S3^{3}, which predicts the output sequence length, schedules generation queries based on the prediction to increase device resource utilization and throughput, and handle mispredictions. Our proposed method achieves 6.49Ă—\times throughput over those systems that assume the worst case for the output sequence length

    Pulsenet - A Parallel Flash Sampler and Digital Processor IC for Optical SETI

    Get PDF
    PulseNet is a full-custom IC with parallel flash ADC and digital processing that enables an all-sky optical search for extraterrestrial intelligence. It integrates 448 sense amplifiers that digitize 32 analog signals at 1GS/s, and other circuits that filter samples, store candidate signals, and perform astronomical observations. Its ~250,000 CMOS transistors (TSMC 0.25ÎĽm) dissipate 1.1W at 400MHz and 2.5V
    • …
    corecore